Theory and Practice
Key Challenge: Financial data is limited, non-stationary, and path-dependent, making proper validation fundamentally different from other fields like machine learning on image or text data.
This form of statistical inflation is called selection bias under multiple testing (SBuMT)
Defining Backtest Overfitting: When the process of backtest optimization leads to strategies that fit the noise in historical data rather than genuine market inefficiencies.
| Performance Statistic | Description |
|---|---|
| PnL | The total amount of dollars (or the equivalent in the currency of denomination) generated over the entirety of the backtest, including liquidation costs from the terminal position. |
| PnL from long positions | The portion of the PnL dollars that was generated exclusively by long positions. |
| Annualized rate of return | The time-weighted average annual rate of total return, including dividends, coupons, costs, etc. |
| Hit ratio | The fraction of bets that resulted in a positive PnL. |
| Average return from hits | The average return from bets that generated a profit. |
| Average return from misses | The average return from bets that generated a loss. |
PortfolioAnalytics and chart the performance of our competing strategies. - You can see the drawdown statistics in the bottom graph
Efficiency statistics provide a relative analysis of the performance of a backtest.
precision is the estimated probability that a randomly selected investment strategy from the pool of all positive backtests is a true strategy.
recall (or true positive rate) is the estimated probability that a strategy randomly selected from the pool of true strategy has a positive backtest
Under the standard Neyman-Pearson [1933] hypothesis testing framework:
\[s=s_T+s_F \\ \text{vvwhere } \\ s_T=\text{number of true strategies} \\ s_F=\text{number of false strategies}\]
\[S_T=s\times \frac{s_T}{s_T+s_F}\]
\[S_F=S-S_T=s \left( 1-\frac{\theta}{(1+\theta)}\right)=s\frac{1}{(1+\theta)} \]
\[\text{precision}=\frac{TP}{(TP+FP)} = \frac{(1-\beta)s_T}{(1+\beta)s_T+\alpha s_F} \\ =\frac{(1-\beta)s\frac{\theta}{(1+\theta)}}{(1-\beta)s\frac{\theta}{(1+\theta)}+\alpha s\frac{\theta}{(1+\theta)}}=\frac{(1-\beta)\theta}{(1-\beta)\theta+\alpha}\]
\[\text{recall}=\frac{TP}{(TP+FN)}=\frac{(1-\beta)s_T}{(1-\beta)s_T+\beta s_T}=1-\beta\]
This is evidence to the pitfall that p-values report a rather uninformative probability. It is possible for a statistical test to have high confidence (low p-value) and low precision.
In particular, a strategy is more likely false than true if \((1-\beta)\theta < \alpha\) such that precision is less than 50%.
\[FDR=\frac{FP}{(FP+TP)}=\frac{\alpha}{(1-\beta)\theta+\alpha}=1-precision\]
# A tibble: 1 × 3
Recall Precision FDR
<dbl> <dbl> <dbl>
1 0.8 0.139 0.861
When Neyman and Pearson [1933] proposed this framework, they did not consider the possibility of conducting multiple tests and select the best outcome.
When a test is repeated multiple times, the combined \(\alpha\) increases.
Consider that we repeat for a second time a test with false positive probability \(\alpha\).
At each trial, the probability of not making a Type I error is \(1-\alpha\)
If the two trials are independent, the probability of not making a Type I error on the first and second tests is \((1-\alpha)^2\)
The probability of making at least one Type I error is the complementary, \(1-(1-\alpha)^2\)
After a family of K independent tests, we reject H0 with confidence \((1-\alpha)^K\)
FWER the probability that at least one of the positives is false, \(\alpha_K=1-(1-\alpha)^K\)
The Sidak Correction: for a given K and \(\alpha_K\) then \(\alpha=1-(1-\alpha_K)^{1/K}\)
However, in the context of finance, the FDR is preferrred as an investor does not typically allocate funds to all strategies with predicted positives within a family of trials, where a proportion of them are likely to be false.
Instead, investors are only introduced to the single best strategy out of a family of thousands or even millions of alternatives
Investors have no ability to invest in the discarded predicted positives.
Following the car analogue, in finance there is actually a single car unit produced per model, which everyone will use. If the only produced unit is defective, everyone will crash.]
Selection bias under multiple backtesting makes it impossible to assess the probability that a strategy is false.
Lopez de Prado (2018) argues that this explains why most quantitative investment firms fail as they are likely investing in false positives
This is because most financial analysts typically assess performance on the Sharpe ratio, not precision and recall.
Lopez de Prado (2020) develops a framework to assess the probability that a strategy is false, using the Sharpe ratio estimate and metadata from the discovery process as inputs
\[ r_t \sim N(\mu,\sigma)\]
\[SR=\frac{\mu}{\sigma}\]
\[\hat{SR}=\frac{E(r_t)}{\sqrt{V_{r_t}}}\]
\[(\hat{SR}-SR) \overset{a}{\to} N \left[0,\frac{1+0.5SR^2}{T}\right]\]
Subsequent evidence showed hedge fund returns exhibit substantial negative skewness, and positive excess kurtosis.
the implication being that assumed IID normal returns will grossly underestimate the false positive probability
Elmar Mertons then derived an asymptotic distribution for \(\hat{SR}\) that include a variance terms which incorporated skewness and kurtosis.
\[\hat{PSR}(SR*)=Z\left(\frac{(\hat{SR}-SR*)\sqrt{T-1}}{\sqrt{1-\hat{\gamma_3}\hat{SR}+\frac{\hat{\gamma_4}-1}{4}\hat{SR}^2}}\right)\]
\[E(\underset{k}{\max}(\widehat{SR_k}))V(\widehat{SR_k})^{-0.5} \approx (1-\gamma)Z^{-1} \left[1-\frac{1}{K}\right]+\gamma Z^{-1}\left[1-\frac{1}{Ke}\right]\]
where \(Z^{-1}\) is the inverse of the standard Gaussian CDF, \(e\) is Euler’s number, and \(\gamma\) is the Euler-Mascheroni constant.
Corollary: Unless \(\underset{k}{\max}(\widehat{SR_k}) >> E(\underset{k}{\max}(\widehat{SR_k}))\) the discovered strategy is likely to be a false positive.
But \(E(\underset{k}{\max}(\widehat{SR_k}))\) is usually unknown, ergo SR is dead.
RgetDistMaxSR<-function(nSims,nTrails,meanSR,stdSR){
out=tibble("Max{SR}"=NA,"nTrails"=NA)
for (nTrails_ in nTrails) {
#1) Simulated Sharpe Ratios
set.seed(nTrails_)
sr<-array(rnorm(nSims*nTrails_),dim = c(nSims,nTrails_))
sr<-apply(sr,1,scale) # demean and scale
sr= meanSR+sr*stdSR
#2) Store output
out<-out %>% bind_rows(
tibble("Max{SR}"=apply(sr,2,max),"nTrails"=nTrails_))
}
return(out)
}library(pracma)
# Create a sequential on the log-linear scale
nTrails<-as.integer(logspace(1,4,100)) %>% unique()
plot(nTrails)
sr0=array(dim = length(nTrails))
for (i in seq_along(nTrails)) {
sr0[i]<-getExpectedMaxSR(nTrails[i],meanSR = 0, stdSR = 1)
}
sr1=getDistMaxSR(nSims = 1000,nTrails = nTrails,meanSR = 0,stdSR = 1)\[\widehat{DSR} \equiv \widehat{PSR}(\widehat{SR_0})=Z \left[\frac{(\hat{SR}-E[\max_k(\widehat{SR_k})])\sqrt{T-1}}{\sqrt{1-\hat{\gamma_3}\widehat{SR}+\frac{\hat{\gamma_4}-1}{4}\widehat{SR}^2}}\right]\]
\(\widehat{DSR}\) can be interpreted as the probability of observing a Sharpe ratio greater or equal to \(\widehat{SR}\) subject to the null hypothesis that the true Sharpe ratio is zero, while adjusting for skewness \(\gamma_3\), kurtosis \(\gamma_4\), sample length and multiple testings.
Calculate DSR requires the estimation \(E[\max_k(\widehat{SR_k})]\) which requires estimating \(K\) and \(V(\hat{SR})\) which is where FML can help.
Specifically, we are employ optimal number of clustering to estimate \(K\) the effective number of trails and then calculate the variances.
Low Risk of Overfitting
High Risk of Overfitting
The risk of false discovery increases exponentially with the number of configurations tested
Critical Result: Even with a 5% significance level, when the odds ratio is low (common in finance), the false discovery rate can exceed 80%.
Best Practice: Document and store all trials, not just the successful ones.
Performance Metrics
Robustness Checks
A comprehensive evaluation requires both performance metrics and robustness checks
Key Question: How can we develop more robust methods for strategy validation in an era of increasing data availability and computational power?

# Generate correlated strategy returns
set.seed(42)
n_strategies <- 50
n_returns <- 252
base_returns <- matrix(rnorm(10 * n_returns), nrow = n_returns)
# Create strategies with varying correlations to base returns
strategies_returns <- matrix(0, nrow = n_returns, ncol = n_strategies)
for(i in 1:n_strategies) {
# Mix of base returns and unique noise
weight <- runif(1, 0.3, 0.9)
base_idx <- sample(1:10, 1)
strategies_returns[,i] <- weight * base_returns[,base_idx] +
(1-weight) * rnorm(n_returns)
}
# Calculate correlation matrix
cor_matrix <- cor(strategies_returns)
# Convert to distance matrix
dist_matrix <- as.dist(1 - abs(cor_matrix))
# Hierarchical clustering
hc <- hclust(dist_matrix, method = "complete")
# Plot dendrogram
plot(hc, main = "Hierarchical Clustering of Strategy Returns",
xlab = "", sub = "", cex = 0.6)
rect.hclust(hc, k = 8, border = "red")# Convert strategies_returns to a data frame
strategies_df <- as.data.frame(strategies_returns)
colnames(strategies_df) <- paste0("Strategy", 1:ncol(strategies_df))
# Calculate optimal number of clusters using silhouette method
# Note: We'll use the data frame directly instead of the distance matrix
fviz_nbclust(strategies_df, FUN = hcut, method = "silhouette",
k.max = 15) +
labs(title = "Optimal Number of Clusters",
subtitle = "Using Silhouette Method")# Cut tree at optimal number
k_opt <- 8 # Based on silhouette plot
clusters <- cutree(hc, k = k_opt)
# Show the first few strategies and their cluster assignments
head(tibble(Strategy = 1:n_strategies, Cluster = clusters), 10)# A tibble: 10 × 2
Strategy Cluster
<int> <int>
1 1 1
2 2 2
3 3 3
4 4 4
5 5 4
6 6 5
7 7 6
8 8 6
9 9 7
10 10 3
The effective number of independent trials (\(K_{eff}\)) is approximately equal to the optimal number of clusters when strategies are grouped by similarity in returns.
# Calculate the number of strategies in each cluster
cluster_sizes <- table(clusters)
# Estimate effective number of trials
k_eff <- length(cluster_sizes)
# Display results
cat("Total strategies tested:", n_strategies, "\n")Total strategies tested: 50
Effective number of independent trials: 8
Reduction factor: 6.25
# Function to calculate DSR
calculate_dsr <- function(strategy_returns, n_effective_trials,
mean_sr = 0, sr_variance = NULL) {
# Calculate Sharpe ratio and its components
n <- length(strategy_returns)
sr <- mean(strategy_returns) / sd(strategy_returns)
# Calculate skewness and kurtosis
z <- (strategy_returns - mean(strategy_returns)) / sd(strategy_returns)
skew <- sum(z^3) / n
kurt <- sum(z^4) / n - 3 # Excess kurtosis
# If SR variance not provided, estimate it
if (is.null(sr_variance)) {
sr_variance <- 1 # Simplification for example
}
# Calculate expected maximum SR
emc <- 0.577215664901532860606512090082402431042159336 # Euler-Mascheroni
exp_max_sr <- (1 - emc) * qnorm(p = 1 - 1/n_effective_trials) +
emc * qnorm(1 - (n_effective_trials * exp(1))^(-1))
exp_max_sr <- mean_sr + sqrt(sr_variance) * exp_max_sr
# DSR calculation
numerator <- (sr - exp_max_sr) * sqrt(n - 1)
denominator <- sqrt(1 - skew * sr + (kurt / 4) * sr^2)
dsr <- pnorm(numerator / denominator)
return(list(
sharpe_ratio = sr,
expected_max_sr = exp_max_sr,
dsr = dsr
))
}
# Calculate DSR for a sample strategy
sample_strategy <- strategies_returns[,1]
dsr_results <- calculate_dsr(
sample_strategy,
n_effective_trials = k_eff
)
# Display results
cat("Strategy Sharpe Ratio:", round(dsr_results$sharpe_ratio, 4), "\n")Strategy Sharpe Ratio: 0.0864
Expected Max SR with 8 trials: 1.459
Deflated Sharpe Ratio: 0
# Function to calculate precision and FDR
calculate_precision_fdr <- function(theta, alpha = 0.05, beta = 0.2) {
recall <- 1 - beta
b1 <- recall * theta
precision <- b1 / (b1 + alpha)
fdr <- 1 - precision
return(c(precision = precision, fdr = fdr))
}
# Calculate precision and FDR for different theta values
theta_values <- seq(0.001, 0.5, by = 0.001)
results <- t(sapply(theta_values, calculate_precision_fdr))
results_df <- tibble(
theta = theta_values,
precision = results[, "precision"],
fdr = results[, "fdr"]
)
# Plot
ggplot(results_df, aes(x = theta)) +
geom_line(aes(y = precision, color = "Precision"), size = 1) +
geom_line(aes(y = fdr, color = "FDR"), size = 1) +
scale_color_manual(values = c("Precision" = "blue", "FDR" = "red")) +
labs(
title = "Precision and False Discovery Rate vs. Odds Ratio",
subtitle = "Alpha = 0.05, Beta = 0.2",
x = "Theta (Odds Ratio of True vs. False Strategies)",
y = "Rate",
color = "Metric"
) +
theme_minimal() +
geom_vline(xintercept = 0.05/0.8, linetype = "dashed") +
annotate("text", x = 0.07, y = 0.5,
label = "Precision = 50%\nwhen θ = α/(1-β)")CPCV provides a framework for model selection that: - Purges training observations that overlap with test observations - Embargoes observations that follow test observations - Generates multiple train/test splits to assess model variance
Walk-Forward Testing
Combinatorial Purged CV
# This is pseudocode for demonstration purposes
implement_cpcv <- function(returns, feature_data, model_func,
n_splits = 5, purge_window = 20, embargo = 5) {
# Define time indices
T <- length(returns)
indices <- 1:T
# Create time-based folds
fold_size <- floor(T / n_splits)
folds <- list()
for(i in 1:n_splits) {
test_start <- (i-1) * fold_size + 1
test_end <- min(i * fold_size, T)
test_indices <- test_start:test_end
# Apply purging: remove from training observations that overlap with test
purge_before <- max(1, test_start - purge_window)
purge_after <- min(T, test_end + purge_window)
purge_indices <- purge_before:purge_after
# Apply embargo: remove from training observations that follow test
embargo_end <- min(T, test_end + embargo)
embargo_indices <- (test_end + 1):embargo_end
# Training indices are all indices except test, purge, and embargo
train_indices <- setdiff(indices, unique(c(test_indices, purge_indices, embargo_indices)))
folds[[i]] <- list(train = train_indices, test = test_indices)
}
# Run model on each fold
results <- list()
for(i in 1:length(folds)) {
train_data <- feature_data[folds[[i]]$train, ]
train_returns <- returns[folds[[i]]$train]
test_data <- feature_data[folds[[i]]$test, ]
test_returns <- returns[folds[[i]]$test]
# Train model
model <- model_func(train_data, train_returns)
# Predict on test data
predictions <- predict(model, test_data)
# Evaluate performance
performance <- evaluate_performance(predictions, test_returns)
results[[i]] <- performance
}
return(results)
}DSR Thresholds for Strategy Selection
Implementation Considerations
| Strategy | Sharpe | DSR | Decision |
|---|---|---|---|
| Strategy A | 1.80 | 0.25 | Reject |
| Strategy B | 2.20 | 0.55 | Further Testing |
| Strategy C | 2.00 | 0.65 | Further Testing |
| Strategy D | 1.95 | 0.98 | Accept |
Meta-labeling separates the problem of side prediction (buy/sell) from the problem of bet sizing.
Benefits
Implementation
Bayesian Sharpe Ratio
Information-Theoretic Approaches
Regularization Techniques
Example: AIC for Strategy Selection
| Strategy | Parameters | Sharpe Ratio | AIC | Adjusted Sharpe |
|---|---|---|---|---|
| Moving Average Crossover | 2 | 1.2 | 120 | 1.130685 |
| Bollinger Band Strategy | 5 | 1.5 | 150 | 1.339056 |
| Multi-factor Model | 12 | 1.8 | 210 | 1.551509 |
| Deep Neural Network | 150 | 2.1 | 350 | 1.598936 |
Best Practices:
AQR Capital Management: One of the pioneers in addressing backtest overfitting
Cliff Asness (AQR co-founder): “We aim to publish strategies with high out-of-sample Sharpe ratios, not just high backtest Sharpe ratios.”
AQR’s Approach: - Long out-of-sample periods (often decades) - Focus on economically justified factors - Implementation across multiple asset classes - Transparency in methodology - Publication of research and results
False Discovery Estimation
Robust Strategy Evaluation Framework
AI and Trading